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Abstract 

This paper is concerned with tuning friction and temperature in Langevin dy- 
\ > r*) ' namics for fast sampling from the canonical ensemble. We show that near-optimal 

acceleration is achieved by choosing friction so that the local quadratic approxima- 
tion of the Hamiltonian is a critical damped oscillator. The system is also over- 
heated and cooled down to its final temperature. The performances of different 
cooling schedules are analyzed as functions of total simulation time. 



1 Introduction 

q ' 

Y^ We propose a method to accelerate the Langevin approach of sampling from Boltzmann- 

• i-H ■ Gibbs (B-G) distribution. Specifically, consider the following Langevin Stochastic Dif- 

£>y ferential Equations (SDE) 

^ ■ 
O ,' J Mdq = pdt 

dp = -VV(q)dt - cpdt + ^/2c/]3dW ( ! ' 



^. ■ where p,q G M. , M is the mass matrix, V(-) is potential energy, c is a positive semi- 

definite d x d matrix indicating the damping coefficient, j3 £ M + is the inverse of tem- 
perature, and W is a standard Wiener process. 

It is known that the stochastic process defined by (pQ) has an invariant distribution 
r~^| . of Boltzmann-Gibbs distribution (also known as canonical ensemble) defined by: 

O dfi = Z~ 1 e^p(-PH{q,p))dqdp. (2) 

where Z = J T * K d exp(—(3H(q,p))dqdp is the partition function, and H(q,p) = p T M~ 1 p/2+ 
V(q) is the Hamiltonian function. 

When the solution of ([T]) is also geometrically ergodic with respect to // (we refer to 
[lj and [2 J for sufficient conditions on the potential V), it is then natural to use long-time 
trajectories of (HD as approximate samples of B-G distribution. 

One important thing to notice is that being able to sample from B-G enables sampling 
an arbitrary probability density function. The trick is to set V(q) = —(3~ 1 lnir(q), and 
then the marginal distribution on q from B-G will have the density function vr(-). 

This paper is concerned with the following questions: 



• Although the friction parameter c does not affect the invariant distribution, it does 
affect the rate of convergence. How should c be chosen for faster convergence and 
hence accelerated sampling? 

• If sampling from B-G is the objective, the inverse temperature j3 does not need to be 
kept constant over the total simulation time T. How to chose the cooling schedule 
t i—T- f3(t),t G [0,T] in order to minimize the distance between the distribution of 
[q{T),p(T)} and the desired B-G? 

Background: There is no need to repeat the importance of sampling the canonical 
ensembles of complicated systems, which is, however, a known computational challenge 
[31 HI [5]. The nonlinearity of the potential and the curse of dimensionality, for instance, 
make sampling methods slowly convergent. 

Classical sampling approaches include purely statistical methods such as Metropo- 
lis algorithm and importance sampling that are solely for sampling purposes (see for 
instance [B] and references therein for a review and comparison), stochastic molecular 
dynamics (primarily Langevin dynamics), deterministic dynamics plus an external ther- 
mostat (such as Nose-Hoover [7J [8], Berendsen [U] or Andersen [TU] thermostats), Hybrid 
Monte Carlo [11] (which introduces auxiliary dynamics to avoid random walks), etc. We 
also refer to [12] as an example that combines stochastic molecular dynamics and purely 
statistical approach. 

Langevin dynamics adds friction and noise to mechanical equations to model energy 
exchange with a heat bath [13} \TQ I15j . It has been shown in the context of classi- 
cal molecular sampling that both stochastic dynamics and deterministic dynamics with 
thermostats outperform purely statistical methods in convergence rate as the size of 
the system grows (we refer to [16] for a linear alkane molecule). Since overdamped 
Langevin is a special case of Hybrid Monte Carlo [17] , it is not surprising to observe 
cases in which Langevin dynamics is computationally more efficient than purely statis- 
tical methods. Moreover, if the system is stiff or multiscale, existing stiff or multiscale 
Langevin integrators such as SIM [18] or FLAVOR [19] can be directly employed for 
accelerated computation. 

Annealing was first introduced in Simulated Annealing algorithm [20j for global op- 
timization, which can also be viewed as (uniformly) sampling from the set of global min- 
imizers of V. Temperature accelerated dynamics has been proposed in [21] for events 
simulations. The concept there is to raise temperature of the system to make rare 
events occur more frequently, intercept each attempted escape from potential wells and 
extrapolate time to low temperature. Another temperature approach has been used to 
calculate free energy [22]. In that method, overheated auxiliary variables are introduced 
to equilibrate the collective variables faster. We stay with the global annealing approach 
used in Simulated Annealing. 

The proposed perspective of tuning friction and annealing temperature is distinct 
from prevailing accelerated sampling methods, such as conformational flooding [23], 
replica exchange [21], umbrella sampling [25], self-guided MD [26J, hyperdynamics [27] . 
affine invariant ensemble sampler [28J, and many others reviewed in [29], and therefore 



can be used concurrently with many of these methods. While tuning friction is mostly 
restricted to dynamics based methods, annealing may apply to any method that involves 
temperature. Note that temperature is a rather general notion because it can often be 
introduced artificially; for instance, see [30j for an example in which temperature is 
introduced in an MCMC algorithm for Bayesian updating. 

2 Method for friction and temperature accelerated Boltzmann- 
Gibbs sampling 

Background algorithms: Although any Langevin integrator can serve as a back- 
ground algorithm and be tuned and annealed, in our numerical simulations we base on 
the lst-order B-G preserving Geometric Langevin Algorithm (GLA) introduced in [31| . 
which is recapped as follows: 



f. _ p -c n h I l-e-^nh 



-1 



+ hp n ( 3 ) 



Pn+1 = Pn- hVV(q n+ i) 

where h is the timestep length, £ n 's are i.i.d. standard normal random variables, and 
c n = c and f3 n = /3 in absence of friction tuning or temperature annealing. 

The choice of GLA is motivated by its conformal-symplecticity and long-time prop- 
erties |31| . Specifically, under certain conditions, GLA is not only pathwise accurate but 
also convergent towards B-G up to a diminishing numerical error. It is worth mentioning 
that similar properties are shown to hold under weaker conditions for a Metropolized 
version of GLA |12^32j. which can also be tuned and annealed for accelerated samplings. 

For multiscale or stiff systems (where V(q) = Vo(q) + e~ 1 V\(q) for instance), FLA- 
VORS [32] are possible alternative background algorithms that are also conformal- 
symplectic (we also refer to SIMS [18J for quadratic stiff potentials). 

Choice of friction: If V is quadratic (of the form V = q 2 9 ), we show in Appendix 

15.11 that optimal acceleration is achieved by choosing c = 2K% so that all degrees of 
freedom of the harmonic oscillator are critically damped. Based on this observation, we 
heuristically propose to tune the friction c n at each time step of the simulation according 
to the Hessian of the potential V: 

|0(<Zn), 0(^0 
2 /4I, otherwise (4) 



where a is a fixed real parameter, preassigned to handle the case of negative curvature; 
for instance, it could be equal to or to the original value of c. 




Choice of temperature: Annealing has successfully been applied to optimization 
problems |20j. A cooling schedule describes how to choose T(n) = l//3 n as a function 
of n. For optimization based cooling schedules, one requires lim^oo T(i) = 0. We 
refer to [331 EH E2] for general reviews of optimization based cooling schedules, and 
to [36 1 [37] for theoretical bounds on convergence. In this paper we are interested in 
situations where the total number of steps N is finite and fixed, the final temperature 
T(N) = Tf = 1//3 > is strictly positive and is the temperature at which one wishes to 
sample the B-G distribution. 

It is then natural to seek to minimize the distance between the distribution of (qn,Pn) 
and B-G at temperature 1//3 using T(l), . . . ,T(N — 1) as optimization variables. In 
Appendix 15.21 we derive a bound on this distance using transition state theory and 
convergence rates of Markov chains. A numerical minimization of that bound suggests 
the following near-optimal cooling schedule for Tf > (for Tf = we refer to [33] and 
references therein) and TV < iVo (Nq is the number of steps needed for sampling by a 
naive Langevin simulation; see Appendix 15.31 for details): 



Pn 



n 1 



NT 



f 



+ (1 



n 1 



T(n) = l//3 n 



(5) 



where N is the total-number of simulation steps, Tf the temperature at which the Gibbs 
distribution needs to be sampled, and the initial temperature T, > Tf is a free parameter 
chosen to overcome the maximal potential barrier, i.e., Tj S> AV/k (for simplicity we 
let the Boltzmann constant k be equal to one in our setting; AV can be intuitively 
interpreted as the maximum elevation in potential landscape, and we refer to [38J for a 
rigorous definition). 

Friction and temperature accelerated sampling: Put together, annealed and 
tuned GLA (AnnealTuneGLA) for accelerated B-G sampling is the following: 



k n 

('■n 
Pn 

Pn 
Qn+l 
Pn+1 



i^fe 




^fe 



dq 

otherwise 



^0 



N T f t K 1 M>Ti 



(6) 



e Cnh p n + 



1-e- 



Pn 



Qn + hp n 

p n - hVV(q n+1 ) 



Comparing to the background GLA, the distribution of the accelerated trajectory at a 
fixed time is closer to the desired B-G in the total variation sense. A possible exact 
preservation of a near-by distribution is, however, not yet proved for AnnealTuneGLA. 
It is worth mentioning that lst-order GLA is not unconditionally stable, nor is An- 
nealTuneGLA. Therefore, h or a should not be chosen to be too large. 



3 Numerical experiments 




Figure 1: Potential energy landscape. 

Consider a one dimensional nonlinear molecular system consisting of two distinct 
heavy (fixed) atoms and a light atom between them. It is modeled as a single de- 
gree of freedom Hamiltonian system with a Lennard-Jones potential function V(q) = 
(q~ 12 — q~ 6 ) + 5 ((4 — q)~ 12 — (4 — q)~ G ) (Figure [1]). The energy landscape consists of a 
local potential barrier and two potential wells. The attraction due to the right atom is 
larger than the left one. If one starts the dynamics with zero initial momentum and posi- 
tion in the left basin, the asymptotic (long time) position distribution will be a marginal 
of B-G and concentrated in the right basin. Therefore, the expectation of position q at a 
fixed time can be used as an indicator of the convergence rate for this nonlinear system. 



Empirical distribution by GLA at Time=40 




Figure 2: Evolution of the empirical distribution obtained by GLA (Eq. [3]) with c = 0.1. The Markov 
process is converging as the distribution peaks more and more in the right potential basin. Simulation 
is done with a step length h — 0.01 and distributions are approximated empirically by an ensemble of 
10000 trajectories. 

Throughout this section we use parameters /3 = 10, q(0) = 1.1 and p(0) = 0. With an 
arbitrarily chosen c = 0.1, Langevin dynamics integrated with a B-G preserving method 
GLA (Eq. [3| takes more than 200 time units before indiscernible convergence (Figure 
[2]). Enumerating c values for fixed /3 (and hence temperature T), one obtains different 
values of E[g(TotalTime)] for a fixed total simulation time (Figure [3]). This confirms 
that the value of c affects the convergence rate. The optimal fixed value is c = 0.7 in 
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Figure 3: Expectations of position at a fixed time for different frictions obtained by GLA (Eq. [3j) . 
Larger expectation implies better convergence in this problem, and therefore this indicates the relation- 
ship between choice of c and convergence rate. The fixed time is TotalTime=100, step length is h = 0.01, 
expectations are calculated by an empirical average over an ensemble of 1000 trajectories, c values are 
enumerated from 0.01, 0.02, . . ., 1.99, 2.00 and 2.10, . . ., 19.90, 20.00. 

this example. 

Although in practice it is rarely the case that an optimization can be carried out 
beforehand to determine the best value of c for fastest convergence of GLA, we never- 
theless use GLA with the optimal friction c = 0.7 for comparison purposes. We will 
show that TuneGLA outperforms even this optimized GLA, demonstrating that c really 
needs to be tuned locally. 

In Figure |U GLA with c = 0.7 (the optimal fixed value), TuneGLA (GLA with 
friction tuning) which adaptively tunes c but does not anneal (Eq. [6]), AnnealGLA 
(GLA with temperature annealing) which uses an inverse linear cooling schedule (C = 
102/ ) but does not tune c (Eq. I20j) . and AnnealTuneGLAs that tune and anneal with 
respectively a = and a = 0.7 are compared. We observe that tuning friction and 
annealing temperature individually accelerates the convergence, and their effects are 
additive. Therefore, the proposed AnnealTuneGLA has the fastest rate of convergence. 
In addition, here the choice of a = slightly outperforms a = 0.7, which is set to be the 
value of the optimal c. The optimal choice of a has not been investigated. 
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5 Appendix 

5.1 Friction accelerated sampling: analysis of linear systems 

In this section, we will show that with /3 fixed, the choice of c = 2\/k will enable the 
fastest convergence of the following system: 

dq = pdt 

dp = —kqdt — cpdt + adW 



where a = y2c/J3. Assume k is a scalar for the moment. For our purpose, consider 
positive k, because if A; is the system decouples, and if k is negative the system is not 
ergodic and does not admit an invariant distribution. 

The solution to the above linear system can be explicitly written as 

f q(t) = B u (t)q(0) + B 12 (t)p(0) + J* B 12 (t - s)adW s 
\ p(t) = B 21 (t)q(0) + B 22 (t)p(0) + J*B 22 (t-s)o-dW s 



where B(t) is the fundamental matrix defined through the following autonomous ODE 
1 



dB 

dt 



-k 



B, and written in block form to be 



B{t) 



B n (t) B 12 (t) 
5 2 i (t) B 22 (t) 



exp 



-k 



0) 



After calculating out the matrix exponential, the expectation of position writes as 
follows 



Eq(t) 



B n (t)q(0) + B 12 (t)p(0) 



4k 



e*< 



>&=&)* (c-y/#=te 



+ 



C2 



2^/1 
i(-c+Vc 2 -4fc)t _ K-c-V^^ifc)* 



4A: 



/(0) 



sTc 



4k 



-p(0) 



(10) 



Naturally, the expectation approaches as £ — )• +oo. Recall that c and A: are non- 
negative reals. We will show in the following discussion that the maximum speed of 
convergence toward will be achieved when c = 2vfc: 



1. When c 2 — 4k > 0, —c — \/c 2 — 4k < —c+\/c 2 — 4k < and none of the coefficients 
are zero. Therefore the bottleneck for convergence of Bn(t) and B\ 2 (t) will be 
±(-c+Vc*-4k)t 



e2' 



which will be minimized as c J, 4k. 



2. When c 2 - 4k = 0, B n = \e' ct / 2 {2 + ct) and B 



12 



= e~ ct lH. 



3. Whenc 2 — 4/c < 0, define a real number u> = \J 4k — c 2 . JBn = e _c4 / 2 (csin(a;i/2)/cj-|- 
cos(wt/2)) and B\ 2 = e~ ct ' 2 2sm.{u]t /2) / 'ui . Notice cos(oot/2) and sin(wt/2) can not 
be simultaneously zero, and therefore the convergence rate is controlled by e _c *' 2 , 
which will be minimized when c 2 ^ 4k. 

Hence when c = 2vfe this linear system ([7|) converges the fastest. Notice that 
this choice corresponds to a critically damped system (as opposed to overdamped or 
underdamped). 

When the system is linear but multi-dimensional, k can be assumed without loss 
of generality to be a symmetric matrix, and it can be immediately seen that there is 
no theoretical difficulty because one can diagonalize k and choose c diagonal wisely. 
Therefore, any numerical method that calculates the square root of a matrix could work 
here for getting c. There are many possible numerical approaches on square rooting 
matrices, for instance by preconditioning if the matrix has some special structure (which 
is usually the case in molecular systems), or as in [39] or [10], but for consideration of 
conciseness the authors will not discuss this numerical topic. 
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5.2 Temperature accelerated sampling: error bound 

Denote by //jv the distribution of (qn,Pn) using a cooling schedule T(-), by ttt(n) the 
B-G distribution at temperature T(N), and by h the integration time-step. 

Assume that the Markov process of (qn,Pn) satisfies a uniform geometric ergodicity 
condition of the type 

||/^j - KT(i)\\TV < Pi\\/J>i-l - ^T{i)\\TV (11) 

where ktU) is the ergodic measure towards which the process converges if one step update 
from (i — l) th step to i th at temperature T(i) is repeated, pi is the convergence rate, and 
statistical distance is measured in total variation norm, which is defined to be: 

|| m -i/|| tv = bup|m(A)-i/(A)| (12) 

AeB 

where B is the u-algebra of measurable space. 

By repetitive applications of triangle inequality, we derive from Equation (fTTj) that: 

N 

a N < aip 2 + ^2 b jPJ ( 13 ) 

with di = \\fii - 7r T (j)||Ty, h = [KT(i-l) ~ ^T(€)\\tv, and -p% = Ilk=iPk- 

We further assume that < p, t < 1 — ^— e T ^ for some constants ho (stable step 
length limit) and Cy (elevation of potential energy). Beyond transition state theory this 
assumption is motivated bv [31]. [32] . [35] . [38] . [5T] and [2J. 

Using the assumption < T(j — 1) — T(j) <C T(j), we deduce a bound (function of 
the cooling schedule) on the sampling error: 



+n(i-^-*) 



(14) 
where «j = Krtj)[H]/T(j) (aj = 1 for harmonic oscillators). 

5.3 Temperature accelerated sampling: cooling schedules 

Naturally, one would like to minimize the error bound (|14p with respect to T(n)'s. This 
is however difficult because of nonlinearity. Instead, we consider the following subsets of 
cooling schedules (denote by Tj the final temperature at which we want to sample the 
B-G, and by N the number of steps we can afford to employ): 
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Inverse logarithmic cooling: 

log(JV + 1) 

±{n) = If- — ; — - (15) 

v ' J log(n +1) v ' 

This is the most popular schedule for optimization ([36, EZ]; f° r instance, have been 
frequently cited), but truncated at Tf before T — > 0. Recall inverse logarithmic cooling 
is T(n) = log fa +1 \ ; and C is fixed by requiring T(N) = Tf. When N is fixed, there is no 
need to choose any parameter. This schedule will serve as our benchmark. 

Shifted inverse logarithmic cooling: 

1 log(n + 1) 
where C > is the free parameter to be optimized. T(N) is set to be Tf. 

Exponential cooling: 

T(n) = T f e d{N ~ n) = T f C N ~ n (17) 

where C = e c > 1 is the free parameter to be optimized. 

Shifted exponential cooling: 

T(n) =Tf + C-C~ n (18) 

where C > and C > 1 are free parameters. For ease on optimization, we chose 
C = lO^TyC^ so that temperatures 'smoothly' cool to Tf, and are left to optimize 
only one free parameter. 

Linear cooling: 

77 77 

T(n) = -T f + (1 - -)T t (19) 

where Tj > Tf is the free parameter. This is used in |44j for optimization purposes. This 
seemingly too fast cooling schedule does give a small error bound in typical cases (see 
below) . 



Inverse linear cooling: 



*w = i/(££ + ci-£)£ I po) 



where T{ > Tf is the free parameter. Instead of linearly interpolating the temperature, 
this linearly interpolates f$ which is the inverse of temperature to ensure more steps at 
low temperatures. 
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Optimal error bound: We optimize error bounds (|14|) for different total numbers of 
steps (iV's) with respect to the cooling schedules described above. As indicated by Table 
[H the optimal schedule depends on the size on the total simulation time via N (to be 
precise, the ratio between Nh and the mixing time of the original system). Unless N 
is too small or too large, optimal inverse linear cooling produces a small error bound, 
optimal linear and exponential coolings have close performances as well, and all three 
optimal cooling schedules are similar. If the number of steps is too small, B-G will 
not be approximated well by any cooling schedule, and it is better to use the trivial 
schedule of constant temperature. If the number is instead too large (usually not the 
case of interest because accelerated sampling is desired), most types of cooling schedules 
will yield small errors, and surprisingly, shifted exponential cooling outperforms inverse 
logarithmic cooling, which is a popular cooling schedule for large N. 



N 


Constant 
(no cooling) 


Inverse log 
(benchmark) 


Shifted inverse log 


Exp 


Shifted exp 


Linear 


Inverse linear 


200 


0.896 


1.304 


0.950 


0.896 1 


0.896 1 


0.896 1 


0.896 1 


600 


0.718 


0.560 


0.752 


0.372 2 


0.718 


0.365 2 


0.368 2 


1000 


0.575 


0.325 


0.597 


0.266 3 


0.346 


0.267 3 


0.265 3 


2000 


0.331 


0.142 


0.336 


0.153 4 


0.161 


0.155 4 


0.151 4 


5000 


0.063 


0.047 


0.064 


0.046 5 


0.028 


0.047 5 


0.046 5 



1 : Achieved by the limiting case of almost constant temperature 

2,3,4,5. Achieved by almost the same linear-alike optimizers within each row 

Table 1: Optimal error bound for different cooling schedules given total steps N. Within each row, bold 
indicates the minimum error bound. Different values of N are chosen to represent regimes of very small, 
small, medium, large, very large N's, in the sense of being compared to the total mixing steps which in 
this case renders the error bound 0.5 with a constant cooling and is N ~ 1250. 

In these experiments, Tj = 20, Cy = 150, ho/h = 1, and ctj = 1. In this typical 
setting Tf/Cy is small and the B-G distribution is concentrated in potential wells, h is 
close to ho, and ay ss 1. If the Tf/Cy is large, however, the optimization suggests not 
to anneal (result not shown). Optimization is done using MATLAB command fmincon. 

Numerical validation on choices of cooling schedule: These cooling schedules 
have been implemented on the example in Section [3l We did not optimize cooling 
schedules with respect to free parameters but used a heuristic/generic constant instead. 
Error on the empirical expectation of position has been investigated for each schedule in 
Figure [5j The ranking of different types of schedules depends on total simulation time 
and agrees with theoretical prediction (except for large total simulation times which are 
dominated by numerical error accumulation). 

In addition to Figure and the above discussion that compare cooling schedules for 
different total simulation times, we fix total time and show time dependent errors of dif- 
ferent schedules in Figure Here total simulation time is 30 and we are in the medium 
N regime. Inverse linear cooling indeed has better performances, followed closely by 
linear cooling, both consistent with the theoretical analysis. Rigorously speaking one 
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Figure 5: Errors of representative cooling schedules as functions of total simulation time (hence of 
total simulation step N too). Errors are calculated by |xf S-j=i8n ~ Eg(oo)|, where M — 10000 is 
the total number of independent trajectories, q % N is the TVth step position of the ith trajectory, TV ■ h 
is the total simulation time and the step length h — 0.01. Eq(co) is well approximated by empirical 
average of an ensemble of 20000 TuneGLA trajectories at total simulation time of 300. Constants used 
in cooling schedules are: Shifted inverse log: C = 0.017/, Exp: C = 1.5, Shifted exp: T(l) = 27/, 
Linear: C = 27/ , Inverse linear: C — 107/ . Basically all settings are the same as in Section [3] except 
for total simulation time and cooling schedule used. Total simulation time is enumerated from 5 to 100 
with an increment of 1. 

should compare cooling schedules only towards the end of the simulation, because dif- 
ferent cooling schedules are at different temperatures in the middle of the simulation; 
however, the superiority of inverse linear cooling is in fact exhibited throughout the 
simulation. 

These numerical experiments and theoretical bounds indicate that inverse linear 
cooling is ranked at the top. It is worth pointing out that although annealing accelerates 
convergence significantly, one has to choose a priori parameters (in most of our cases, 
total simulation step N and constant C or Tj). This issue usually needs a case-by-case 
investigation, but Cy (if known) could be used in conjunction with the error bound to 
determine N and C. 
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Figure 6: Comparison of errors of TuneGLA with c adaptively tuned and AnnealTuneGLA with 
different cooling schedules. Again, TuneGLA uses a — 0.7, total simulation time=30 is fixed, and all 
other settings are the same as in Figure \S\ and U too. 
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